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Abstract 

This paper studies how the generalization abihty of neurons can be affected by 
mutual processing of different signals. This study is done on the basis of a feedforward 
artificial neural network. The mutual processing of signals can possibly be a good 
model of patterns in a set generalized by a neural network and in effect may improve 
generalization. In this paper it is discussed that the interference may also cause a 
highly random generalization. Adaptive activation functions arc discussed as a way 
of reducing that type of generalization. A test of a feedforward neural network is 
performed that shows the discussed random generalization. 
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1 INTRODUCTION 

A feedforward artificial neural network, further denoted by FNN, can be viewed as a rather 
'unconstrained' structure - in a typical multilayered architecture an output of a neuron 
in one layer is simply connected to ah inputs in the succeeding layer, and the weights of 
connections ca n just be initialized random ly. The combination function of an artificial 
neuron of the iMcCulloch and PittsI type treats all its arguments as equivalent. 



simply adding them. In the process of training, attributes of the training observations 
are propagated through such a relatively generic structure, possibly in a random order. 
It may rise several questions. How that somewhat unconstrained structure of an artificial 
neural network copes with generalization, especially when there are several 'competiting' 
stimuli, that simultaneously want to be 'extrapolated' onto 'regions' in the inputs space 
of the FNN not covered by the training data. How such conflicts can possibly destroy the 
ability of generalization, and what can be the ways to reduce such phenomena? 



2 RANDOM GENERALIZATION 

The summing of signals in the combination function of an artificial neuron, called here 
an interference of signals, may improve generalization. For example, in the case of a 
multi-dimensional data set, processing of values from one input of a neural network can 
be influenced by values at another input of the neural network, what may model well the 
patterns in the training set. The error-minimizing learning process can prevent harmful 
interference if the interference would increase the neural network error of approximation 
of the training set. The signals propagated from attributes of observations that are absent 
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in the training set, however, can be interfered with no effect on the error. Therefore, 
the interference can decrease the generahzation abihty of the network. A de crease of 
generahzation quahty in neural networks can also be an effect of overfitting (|Schaffei . 
199ll : Rosin and Fieren j . 19951 : Lawrence et al. . 1997 : Lawrence and Gile^ . 2OO0I ). Yet the 
worsening of generalization caused by the discussed interference can be very different 
from that caused by overfitting While excessive fitting of the neural network function to 
the training set means only that some particular patterns of the set are memorized, the 
discussed interference of signals may introduce highly random changes to the generalizing 
function of the neural network. 

Let us further discuss such a type of a random generalization in more detail. 



3 STRONG PROPAGATION REGIONS 

In this section the so-called strong propagation regions in the input spaces of neurons 
will be discussed. The notion will be used further in this paper to describe the discussed 
interference of signals. 

A neuron with linear weight functions and a hyperbolic tangent activation function 
has its output value equal to a given value r for its input values that, in the neuron 
input space, create a hyperplane Pr, except of the special case where all weights in the 
neuron are equal to 0. Specifically, there is a hyperplane Pq for the neuron output value 
equal to 0. Because the hyperbolic tangent activation functions have the greatest value 
of its derivative at 0, the hyperplane Pq is the region in the neuron input space for which 
there is the strongest propagation of signals through the neuron. As the distance from 
this hyperplane increases, the derivative of the activation function decreases and in effect 
the propagation becomes weaker. Let us call the region with relatively strong level of 
propagation a strong propagation region. Let the region consist of points whose distances 
to Pq in the input space of the neuron do not exceed a certain value. 

Let there be two fully connected subsequent layers Li and Lj+i in a feedforward neural 
network. Let there be Ni and A^j+i neurons in the layers, respectively. Let us discuss the 
input spaces of the neurons in the layer Lj+i. Each of the neurons in the layer Lj+i has 
A'j + 1 inputs, Ni of which are from the neurons in the preceding layer and a single input 
is from the bias element. Therefore, the transformation made in the layer Lj+i can be 
represented by parameterized Ni^i A^j-dimensional input spaces of the neurons in Lj+i, 
where the parameters in the spaces are the values of functions of the respective neurons 
in Lj+i. 

An example of input spaces of neurons in Lj+i is shown in Fig. ^ The lines represent 
the hyperplanes Pq, denoted by Pj, j = 0, 1, . . . A^i+i — 1, where j denotes a respective 
neuron in the layer Lj+i. This is not a full representation of the input spaces of the 
neurons in the discussed layer, because the values of functions of the neurons are not 
given, yet this diagram shows the regions with the strong propagation of signals, being 
on and near the hyperplanes Pj. The values propagated to the neurons in the layer Lj+i 
are either the direct values of attributes of observations if Lj+i is the first hidden layer, 
or images of the attributes if Lj+i is any of the succeeding layers. Anyway, the region rt 
of values propagated from the observations in the training set and the region Vg of values 
propagated from the observations in the generalized set can be shown in the input spaces 
of the neurons, as it is done in Figure ^ In the example diagram, the region rt consists 
of two regions r*, p = 1, 2, and the region consists of another two regions r^, q = 1,2. 
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Figure 1: An example diagram of input spaces of neurons in a layer. 

The regions are schematically shown by solid regions in the diagrams, but they are sets of 
discrete points, where each point corresponds to one or more observations. 

Let each observation has its input attributes, that is these that are propagated from 
the inputs of a neural network, and its output attributes, that is these that are compared 
to values at the outputs of the network. The hyperplanes Pj in the example diagram 
generally concentrate in or near the regions r*. This may happen during the training 
process if there are relatively large differences between the values of output attributes 
of observations whose input attributes are propagated through r*. Thus, relatively high 
values of derivatives of functions of the neurons in -Lj+i may correspond to relatively 
large differences between the output attributes of observations in the training set. The 
hyperplanes Pj, by extending infinitely in the space, may allow for generalization to the 
points outside r^, including the points that are relatively far from r^. 

4 INTERFERENCE OF SIGNALS 

Let us discuss again the diagram of input spaces of neurons in Figure ^ Let there be 
several hyperplanes Pj, denoted by Pj{i), where j determines a respective neuron and 
i = 1,2, that were placed during the learning process near rt, to minimize the component 
of S^i caused by the observations in the training set, whose attributes propagate through 
rt- They are marked in the diagram by solid lines for i = 1 and by dotted lines for 
i = 2. Let the regions rf and be overlapping or be near to r\ or r^, respectively. Let 
the observations whose input attributes are propagated through the regions rf and be 
generalized well because of the hyperplanes -Pj'(l) and Pj{2), respectively. This is possible 
because the hyperplanes Pj{^) extend from r\ and the hyperplanes Pj{2) extend from r^, 
thus 'extrapolating' the patterns in the region rt- 

Now, if a hyperplane Pj{i), that normally is generalizing patterns in rj, would by a 
chance 'intersect' r^_^, like Pq"(1) does, it could possibly increase the training error and 
thus in a possible further training the intersecting hyperplane Pj{i) could, for example, 
be driven out of Yet if the hyperplane would intersect r^_^, like Pq{1) does, it could 
intervene the generalization from r^_^ to r^_ - without any reaction in the training process. 
More, a region r* could, during the training, be placed itself in r^_^, thus causing all Pj{i), 
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associated with generalization of r* , to intervene the generaUzation to r^_^ . 

The interference of signals, causing a possibly high randomness of generalization, could 
be reduced if the strong propagation region of a neuron would no t extend itself infinitely 
in space. This is like in the racl i al basis function neural n etworks ( Broomhead and Lowd . 
19881 : iMoodv and Darkenl . Il989l : IPogrio and Girosil . Il989l l . On the other hand, such forms 
of finite strong propagation regions like in the radial basis function networks could worse 
the ability of generalization of a neural network for sets where long strong propagation 
regions are needed for good generalization. A possible method of finding a good trade-off 
between infinite and finite strong propagation regions could be using adaptive activation 
functions. Such adaptive activation functions could, during training with a special learning 
algorithm, smoothly adapt their form, for example in the range between a radial basis 
function and a hyperbolic tangent. 



5 TESTS 



Because in some relatively simple generalization problems that were conducted the dis- 
cussed random generalization seemed to be rather rarely observed - usually the trained 
neural networks after some time began only to overfit the data, showing only some ran- 
domness connected with a limited fiexibility - in this test a relatively complex training 
set will be used. 




Figure 2: The data sets (a) 9i, (b) 9c and (c) the training subsets mask. 



Let there be two three-dimensional sets 9i and 9c, as illustrated in Figures E^a) and 
^h), respectively. The sets are 64 x 64 images, whose pixel coordinates determine the 
neural network input vector values, a single value for each dimension, and the pixels 
brightnesses determine corresponding values in the neural network output vectors. The 
pixel at the lower left corner has the coordinates (—0.5, —0.5) and the pixel at the upper 
right corner has the coordinates (0.5,0.5). The brightness of the pixels represents the 
range from —0.5 for black to 0.5 for white. Feedforward layered networks with two inputs, 
a single neuron in the output layer and two hidden layers of 16 neurons each, were trained 
by the training subsets of either 9i or 9c- The neural networks had hyperbolic tangent 
activation functions. Th ere was a weight decay at a rate of 2-10"'^ to improve generalization 
( Krogh and HertTj . Il993 v An online training was used with a learning step of 0.02. The 
training subsets are represented by the image in Figure [^c). Black pixels in the image 
mean that the corresponding pixels in Figures |2Ka) andlSfb) represent the training subsets 
of the respective generalized sets. 

There were four neural networks Mj, i = . . .3, trained with the subset of 9i, and four 



another neural networks J\f[, i = ... 3, trained with the subset of 9c- The generalizing 
functions of the networks were sampled and the weights of the neurons in the first input 
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layer were saved at each of the iterations 10000000th, 31622777th and 100000000th. The 
results are illustrated in Fig. |21 There is a table for each iteration in the figure, with 

Iteration M[ Ni 



10000000 



31622777 



100000000 




Figure 3: The generalizing functions and diagrams of the zeroes of the first hidden layer 
neurons. 

sampled generalization functions in the upper row and diagrams representing input spaces 
of neurons in the first hidden layer in the lower row. The representation of the generaliza- 
tion functions is analogous to that of the sets 6i and 9c- Each of the input space diagrams 
shows with translucent lines the zeroes of the outputs of the first hidden layer neurons, 
that is it shows the hyperplanes P^, against the common input values from the input 
layer. The lower left corner of the dotted rectangles drawn within the diagrams represents 
input values (—0.5, —0.5) and the upper right corner of the rectangles represents input 
values (0.5,0.5). Therefore, the input attributes of the observations in the sets Oi and 
9c are propagated into the space marked in the diagrams by the dotted rectangles. The 
propagation to the first hidden layer is without any transformation of course, because the 
nodes in the input layer only pass signals to the first hidden layer. 

Let us look at the diagrams of the input spaces of the neurons in the first hidden layer. 
Because of the direct relation between the space of the input attributes of the observations 
and the input spaces of the first hidden layer neurons it can be said that in the cases of 
both Ml and A/'f the hyperplanes Pj' generally concentrate as it was discussed in Sec. El 
In particular, in M^, generally some hyperplanes concentrate near the linear features // 
and some concentrate near the circular features fc- In effect, the lines in the diagrams 
concentrated near fc cross these concentrated near Additionally, the crossings occur 
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partially in the region not covered by the training set. These are exactly the conditions 
prone to the random generalization, discussed in Sec. [l] In fact, unlike J\f-, where the 
hyperplanes finely 'extrapolate' the regions in the training file, in the functions of Af[ a 
highly random generalization can be seen. 

6 CONCLUSIONS 

It was discussed that the interference of signals within a FNN, while possibly being one of 
its strengths, may also cause a substantially random generalization. Tests of generalization 
of two sets of data was presented. The obtained generalizing function was relatively 
predictable in the case of one of the sets, and there was a high randomness in the function 
in the case of the other set. 
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